Skip to content

OCR parity: markitdownnet tuned (eng, PSM=6, DPI=300)#25

Open
mapo80 wants to merge 4 commits intomainfrom
codex/improve-.net-pipeline-for-ocr-results
Open

OCR parity: markitdownnet tuned (eng, PSM=6, DPI=300)#25
mapo80 wants to merge 4 commits intomainfrom
codex/improve-.net-pipeline-for-ocr-results

Conversation

@mapo80
Copy link
Owner

@mapo80 mapo80 commented Aug 21, 2025

Summary

  • expose explicit OCR tuning in MarkItDownOptions (DPI, PSM, OEM, threads, force raster)
  • add Rasterizer for uniform 300 DPI preprocessing with Otsu binarisation and light deskew
  • drive Tesseract with the new options and wire benchmarks/smoke tests via OcrBench

Testing

  • dotnet run --project tools/OcrBench -- extract --input-dir dataset/validation --out-dir dataset/validation/_ocr --threads 1 --langs eng --psm 6 --refresh markitdownnet
  • dotnet run --project tools/OcrBench -- compare --ocr-dir dataset/validation/_ocr --out-json artifacts/validation/OCR/bench-ocr.json --out-md artifacts/validation/OCR/summary-ocr.md

https://chatgpt.com/codex/tasks/task_e_68a77ea46bdc8325bb77f0d7313776cc

mapo80 added 3 commits August 21, 2025 22:47
…data; regen markitdownnet only (eng, PSM=6, DPI=300)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant